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Abstract.  The  Boltzmann  distribution  used  in  the  steady-state  analysis  of  the  simulated 
annealing  algorithm  gives  rise  to  several  scale  invariant  properties.  Scale  invariance  is  first 
presented  in  the  context  of  parallel  independent  processors  and  then  extended  to  an  abstract 
form  based  on  lumping  states  together  to  form  new  aggregate  states.  These  lumped  or  aggre¬ 
gate  states  possess  all  of  the  mathematical  characteristics,  forms  and  relationships  of  states 
(solutions)  in  the  original  problem  in  both  first  and  second  moments.  These  scale  invari¬ 
ance  properties  therefore  permit  new  ways  of  relating  objective  function  values,  conditional 
expectation  values,  stationary  probabilities,  rates  of  change  of  stationary  probabilities  and 
conditional  variances.  Such  properties  therefore  provide  potential  applications  in  analysis, 
statistical  inference  and  optimization.  Directions  for  future  research  that  take  advantage  of 
scale  invariance  are  also  discussed. 
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1.  Introduction 

In  recent  years,  the  concept  of  scale  invariance,  often  described  in  terms  of 
self-similarity ,  has  received  much  attention.  It  describes  such  disparate  phe¬ 
nomena  as  the  structure  of  a  snowflake  to  the  behavior  of  the  stock  market 
(Mandelbrot,  1983).  These  two  cases  represent  the  extremes  with  which  scale 
invariance  is  apparent.  The  repeating  geometrical  patterns  at  different  scales 
often  found  in  nature  arc  quite  compelling  even  while  the  mathematics  that 
fully  describes  this  scale  invariance  is  quite  arcane.  At  the  other  extreme  is 
the  scale  invariance  associated  with  stochastic  and  diffusion  processes.  In  this 
realm,  scale  invariance  departs  from  the  visual  and  wends  its  way  into  the  ab¬ 
stract  where  the  probabilistic  nature  of  sample  paths  becomes  the  cornerstone 
of  pattern  finding. 

The  notion  of  pattern  finding  is  key  in  discovering  and  utilizing  the  con¬ 
cept  of  scale  invariance.  Thus,  the  perspective  of  an  analyst  is  paramount: 
finding  any  pattern  depends  on  how  you  look  at  things.  Indeed,  it  is  often 
possible  to  find  scale  invariance  in  almost  any  phenomenon  if  one  stretches 
definitions  sufficiently;  whether  such  scale  invariance  is  useful  or  even  real 
depends  on  the  vantage  point  from  which  such  scale  invariance  is  discovered 
and  whether  interesting  patterns  persist  in  related  areas. 

This  paper  identifies  a  form  of  scale  invariance  in  the  simulated  annealing 
(SA)  algorithm.  This  scale  invariance  is  manifest  in  several  ways.  (Fleischer, 
1999)  shows  scale  invariance  associated  with  parallel,  independent  proces¬ 
sors.  This  article  describes  a  scale  invariance  based  on  lumping  solution  states 
together  to  form  aggregate  states. 

To  fully  describe  this  scale  invariance  requires  a  definition  of  the  type  of 
scales  used.  Section  2  provides  this  necessary  background  and  illustrates  a 
scale  invariance  with  respect  to  a  system  of  independent  processors  thereby 
providing  the  basis  for  comparisons.  It  also  describes  the  indexing  method 
used  in  conjunction  with  aggregate  states.  These  methods  arc  then  used  in 
Section  3  to  show  scale  invariance  in  SA  between  individual  solution  states 
and  aggregate  solution  states.  Section  4  describes  potential  applications  in  the 
areas  of  analysis,  statistical  inference  and  in  optimization.  Finally,  Section  5 
provides  a  summary  of  this  article,  describes  areas  of  future  research,  and 
some  concluding  remarks. 


2.  Background  and  Motivation 

The  concept  of  scale  invariance  in  the  literature  on  dynamical  systems  per¬ 
tains  to  phenomena  that  retains  some  property  at  different  scales.  Demon¬ 
strating  this  property  therefore  requires  a  comparison  of  some  phenomenon 
at  different  scales  and,  hence,  an  appropriate  description  of  the  phenomenon 


lumpingRevlb.tex;  26/03/2002;  13:55;  p.2 


Scale  Invariance  Properties  in  the  Simulated  Annealing  Algorithm 


3 


that  is  being  compared  and  also  a  description  of  the  type  of  scaling  used.  In 
other  words,  it  must  be  shown  that  something  is  similar  to  something  else  even 
though  its  scale  or  definition  is  different.  The  following  equations  provide  the 
foundations  for  these  comparisons  and  are  well  known  (see  e.g.,  (Mitra  et  al., 
1986;  Aarts  and  Korst,  1989))  in  the  context  of  SA.  These  results  concern 
the  stationary  probabilities  associated  with  state  i  in  a  discrete  optimization 
problem. 

If  the  SA  algorithm  is  executed  at  some  fixed  temperature  t,  then  the  fre¬ 
quency  of  visits  to  some  particular  state  i  C  il  where  Q  is  the  state  space  is 
given  by  the  stationary  probability  distribution 


m(t) 


Ejene-V*’ 


(1) 


where  /.,  is  the  objective  function  value  associated  with  state  i.  (Mitra  et  al., 
1986)  shows  that  the  rate  of  change  of  the  stationary  probabilities  associated 
with  some  state  i  with  respect  to  the  temperature  parameter  t  is 


L fi  ~  (/)(*)]  (2) 

(Mitra  et  al.,  1986;  Aarts  and  Korst,  1989)  where  (f)(1)  is  the  expected 
objective  function  value  at  temperature  t. 

Mitra  also  shows  how  this  rate  of  change  depends  on  the  quantity  in  brack¬ 
ets  (Mitra  et  ah,  1986,  p.755-6).  For  an  optimal  state  f ,  fi *  <  </)(*)  for 
t  >  0,  hence  the  derivative  is  negative.  Consequently,  the  stationary  proba¬ 
bilities  of  the  optima  monotonically  increase  with  decreases  in  temperature 
values. 

Equations  (1)  and  (2)  form  the  basis  of  the  scale  invariance  first  described 
in  (Fleischer,  1993),  and  further  developed  in  this  paper.  These  equations  have 
been  used  in  many  of  the  theoretical  results  on  the  convergence  of  SA  and  its 
finite-time  performance,  yet  the  scale  invariance  associated  with  (1)  and  (2) 
has  yet  to  be  fully  described  and  exploited. 

The  next  section  provides  some  background  on  the  concept  of  scale  in¬ 
variance,  the  definitions  that  arc  needed  for  identifying  aggregate  states,  and 
describes  the  scale  invariance  in  SA  using  several  lemmas  and  theorems. 


3.  Scale  Invariance  in  SA 

The  term  scale  invariance  is  usually  employed  to  describe  phenomena  and 
properties  that  seem  to  exist  or  persist  on  different  scales.  These  scales  can 
be  the  physical  dimensions  of  an  object,  time,  or  some  other  property  that  is 
associated  with  the  phenomenon  of  interest.  Indeed,  in  self-similar  systems 
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as  they  arc  sometimes  called,  it  is  often  impossible  to  determine  the  scale  of 
the  relevant  phenomenon  in  question  merely  by  observing  it.  For  example,  in 
diffusion  processes  (such  as  Wiener  Processes),  the  scale  of  the  physical  di¬ 
mension  of  sample  paths  cannot  be  ascertained  simply  by  viewing  the  sample 
paths  as  they  appeal-  and  mathematically  behave  the  same  at  any  scale  (Ross, 
1970).  The  same  is  true  in  fractal  geometry:  the  patterns  that  emerge  from  the 
application  of  recursive  functionals  are  repeated  at  all  levels  of  magnification 
or  scale  (Mandelbrot,  1983).  The  system  on  one  scale  appears  like  itself  on 
another  scale.  Such  properties,  whether  they  are  the  behavior  or  the  attributes 
of  some  system  that  are  invariant  in  terms  of  scale  indicate  some  form  of  scale 
invariance. 

The  foregoing  description  of  scale  invariance  is  unfortunately  rather  vague 
and  a  more  concrete  description  of  scale  invariance  is  desirable.  A  more 
appealing  way  to  define  scale  invariance  is  in  the  following  abstract  terms: 
if  statement  A  implies  B,  then  scale  invariance  exists  if  a  transformation 
applied  to  A  resulting  in  A!,  and  applied  to  B  resulting  in  B' ,  implies  that 
A!  implies  B' .  This  definition  suggests  that  exploring  valid  examples  of  scale 
invariance  requires  reasonable  definitions  of  various  mathematical  elements 
and  a  showing  of  how  they  relate  to  those  mathematical  quantities  that  reflect 
scale  invariance. 

In  S  A,  this  scale  is  not  based  in  terms  of  physical  dimensions  or  time,  but  is 
more  abstract  and  relates  to  the  states  of  discrete  optimization  problems.  The 
scale  is  based  on  the  level  of  aggregation  of  states  be  it  in  terms  of  the  states  of 
several  processors  or  in  terms  of  the  states  in  a  single  processor  system.  The 
invariant  properties  associated  with  these  aggregate  states  involves  the  rela¬ 
tionships  between  their  stationary  probabilities,  objective  function  values,  the 
rate  of  change  of  their  stationary  probabilities,  and  the  variance  of  objective 
function  values  when  the  SA  algorithm  is  applied  to  a  discrete  optimization 
problem.  Before  exploring  the  scale  invariance  of  aggregated  states,  however, 
a  brief  description  of  scale  invariance  in  the  context  of  parallel  processing  is 
presented. 

3.1.  Parallel  Processing  in  S A 

To  motivate  the  notion  of  aggregating  states  in  SA,  consider  a  system  of 
parallel  processors  each  running  the  SA  algorithm  on  a  given  combinatorial 
optimization  problem.  Such  a  system  of  processors  also  exhibits  a  form  of 
scale  invariance  (see  (Fleischer,  1999))  in  that  the  stationary  probabilities  and 
the  rate  change  of  the  stationary  probabilities  associated  with  a  system  state 
has  the  same  form  as  the  analogous  quantities  in  a  single  processor  system. 
For  a  system  of  p  independent  processors,  each  of  which  is  in  some  state 
i,  the  system  state  can  be  represented  in  a  product  space  by  i\  .A. ... .  ip 
and  its  stationary  probability  represented  as  7rjx . ip{t).  In  (Fleischer,  1999) 
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showed  that 


7T, 


*1,*2 


a  /l'l  , *2. 


E 


*1, *2, •••,%> 


(3) 


where  /,,  j,2,...,ip  =  El-i  //„,  represents  the  system’s  objective  function 
value  (the  sum  of  the  objective  function  values  associated  with  each  proces¬ 
sor).  Note  that  the  form  of  (3)  is  similar  to  (1). 

The  following  equations  extend  this  result  by  showing  that  the  rate  of 
change  of  the  stationary  probability  with  respect  to  temperature  t  for  p  par¬ 
allel  processors  has  the  same  form  as  the  rate  of  change  of  the  stationary 
probability  associated  with  a  single  processor.  Taking  the  derivative  of  (3) 
with  respect  to  temperature  t. 


d^iui  2 ipjt) 

dt 


Q  g  /«l.»2’-">*p/^ 

<)  /.  V .  .  .  g  I 

-Ml,---' ,,  ° 


'b'l  ./2,..,i;,  (0 

t2 

TT/'i  (0 

t2 


L 

>_ 

,m=  1 


P 

m=l 


£  fim  ~  d  fcm  }  (*) 


[/n  )*2v)*p  -  (/n  )W] 


(4) 


where  (4)  is  similar  to  (2).  This  similarity  is  apparent  simply  because  of 
how  /i)  has  been  defined.  Thus,  by  making  meaningful  and  logical 

definitions  of  other  elements  associated  with  SA  it  is  possible  to  extend  the 
similarity  apparent  from  the  aggregation  of  states  of  multiple  processors  to 
the  aggregation  of  states  in  a  discrete  optimization  problem. 


3.2.  Aggregating  States  in  SA 

To  show  how  lumping  states  together  into  an  aggregate  state  exhibits  scale 
invariance,  it  is  necessary  to  identify  these  aggregate  states.  This  requires 
some  method  for  indexing  these  states  so  they  can  be  uniquely  identified. 
How  this  is  done  is  crucial  towards  demonstrating  scale  invariance. 

In  many  discrete  optimization  problems,  the  index  associated  with  a  state 
is  either  arbitrary  and  merely  used  to  distinguish  between  states  (such  as  in  a 
proof)  or  used  to  indicate  some  other  information  about  the  state  it  represents. 
In  such  a  case,  some  specific  attribute  that  not  only  uniquely  describes  the 
particular  state  but  also  provides  other  useful  information  must  be  devised.  In 
SA,  and  in  particular,  in  terms  of  the  stationary  probability  associated  with 
states,  this  is  often  done  by  using  an  arbitrary  i  for  a  non-optimal  state  and 
an  i*  for  an  optimal  state.  This  index  is  of  limited  use  however  for  aggregate 
states  as  more  information  than  simply  distinguishing  an  optimal  state  from 
other  states  is  necessary. 
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In  lumping  states  together,  not  only  is  some  method  of  designating  them 
required,  but  their  objective  function  values  must  also  be  designated.  The  no¬ 
tion  that  such  aggregate  states  have  an  objective  function  value  must  therefore 
be  considered.  The  following  definitions  provide  the  indexing  conventions 
used  in  the  next  several  sections.  These  conventions  denote  states,  collections 
of  states,  stationary  probabilities,  and  objective  function  values,  and  arc  based 
on  both  arbitrary  denotations  and  denotations  reflecting  some  ordering  of 
objective  function  values. 

3.3.  Definitions 

The  following  definitions  arc  needed  to  show  how  lumped  states  have  self¬ 
similar  properties  as  individual  states.  These  basic  definitions  arc  used  later 
to  define  new  characteristics  of  lumped  states  such  as  their  indices,  objective 
function  values  and  stationary  probabilities. 

Definitions: 

17  the  entire  set  of  states  in  a  discrete  optimization  problem. 

i  an  arbitrary  state  in  a  discrete  optimization  problem  that  identifies  a  partic¬ 
ular  state. 

7 r,;(f)  the  stationary  probability  of  state  i  at  temperature  t. 

fi  the  objective  function  value  associated  with  state  i. 

F  a  random  objective  function  value  produced  by  the  SA  algorithm.  Thus,  its 
probability  distribution  is  Pr {F  =  /}  =  e~h . 

(f){t)  =  7T,  (/;)/,,  expected  objective  function  value  at  temperature  t. 

3.4.  Aggregating  States 

To  show  scale  invariance  based  on  lumping  states  together  to  form  aggre¬ 
gate  states,  these  aggregate  states  must  have  similar-  attributes  and  similar 
mathematical  relationships  associated  with  individual  states.  It  is  therefore 
necessary  to  assign  stationary  probabilities  and  objective  function  values  to 
these  states  based  on  some  reasonable  and  logical  criteria  and  then  investigate 
their  relationships  to  determine  whether  they  are  similar-  to  the  corresponding 
relationships  associated  with  individual  states. 

3.4.1.  Stationary  Probabilities 

Let  A  =  {%]  .i‘2:  ■  ■  ■ :  im  \  be  some  arbitrary  set  of  m  states  where  A  C  17. 
A  reasonable  approach  for  defining  the  stationary  probability  of  A  is  based 
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on  the  frequency  of  occurrence  of  any  state  in  the  set  A.  Using  indicator 
variables,  define 

l  (f)  —  I  1  ^  die  current  state  i  (E  A  at  temperature  f 

'  \  0  otherwise 

Note  that  for  each  state  i,  1,  (t)  =  1(0)  if  the  current  state  in  SA  at  tem¬ 
perature  f  is  (not)  i.  Thus,  the  stationary  probability  may  be  defined  using 
indicator  variables  and  the  law  of  total  probability: 

nA{t)  =  E{l^(i)}  =  Pr{l.4(f)  =  1} 

m 

=  Zl7r*(*)  =  (5) 

ieA  1=1 

These  definitions  formalize  the  notion  that  when  the  SA  algorithm  visits 
any  state  in  A,  the  algorithm  visits  A  itself,  i.e.,  the  frequency  of  visits  to  A , 
7T/|  (f),  is  the  sum  of  the  frequency  of  visits  to  states  in  A. 


3.4.2.  Objective  Function  Values 

Defining  objective  function  values  for  aggregate  states  is  not  quite  as  simple 
as  in  the  case  involving  stationary  probabilities.  In  this  case,  a  visit  to  a  state  in 
A  gives  A  an  associated  objective  function  value  that  may  differ  from  that  of  a 
previous  visit.  Thus,  instead  of  simply  counting  visits  to  any  state  to  establish 
the  relative  frequency  of  that  state,  the  attribute  of  each  state — its  objective 
function  value — must  be  taken  into  account.  Note  also  that  the  frequency  of 
visits  to  each  element  of  set  A  is  dependent  on  the  temperature  t. 

Given  that  the  objective  function  values  associated  with  set  A  may  vary 
over  the  course  of  an  SA  experiment,  the  most  reasonable  approach  for  as¬ 
signing  an  objective  function  value  is  to  take  the  time  average  of  the  objective 
function  values  obtained  over  the  course  of  visits  to  set  A.  This  suggests  an 
expression  based  on  conditional  expectation.  Define  the  objective  function 
value  of  lumped  node  A  by 


fA(t)  =  E {F  |  l.4(t)  =  1} 

=  J2f*MF  =  fi\iA(t)  =  i} 

_  ST'  fi  Pr{^  =  fi  F  1.4(f)  =  1} 

'  k 

E7l=lKH(t)fH  =  v  Ttj(t)fi 

z?=iH(t)  k**®' 

Thus,  the  objective  function  value  of  set  A  is  the  weighted  average  or  convex 
combination  of  the  objective  function  values  of  the  states  it  contains,  i.e., 


lumpingRevlb.tex;  26/03/2002;  13:55;  p.7 


Fleischer  and  Jacobson 


the  expected  objective  function  value  of  states  in  set  A.  Note  that  using  the 
definition  in  (6),  the  identity 


Mt)  =  (/)(*) 


holds  for  all  /  >  0. 


3.4.3.  Consistency 

For  scale  invariance  to  exist,  the  values  for  tt. \  (/:),  its  rate  of  change,  and  f\  (/) 
must  have  similar  relationships  as  the  corresponding  values  for  individual 
states.  Further,  true  scale  invariance  must  be  consistent ,  that  is  it  must  be 
evident  at  all  scales.  This  means  that  aggregations  of  aggregate  states  should 
obey  the  same  relational  rules.  To  illustrate,  let  A  and  B  be  disjoint  aggregate 
states.  Then  from  (6),  the  objective  function  value  of  aggregate  state  A  U  B  is 


/.4Ui?(f) 


ieAUB  nAUB{t) 

y~^  7Ti(t)f  j  y~^  7r i  ( t)fi 

ieA*AUB(t)  ieB7TAUB(t) 

nA{t)fA{t)  +  7T 


7TA(t)  +  irB{t) 


(V) 


where  (7)  is  obtained  by  dividing  and  multiplying  each  summation  by  71 ^(t) 
and  7 Ts(t),  respectively  and  using  the  definitions  of  f  \  (/:)  and  fn(i)-  Scale 
invariance  is  manifest  in  (7)  because  this  equation  of  the  objective  function 
of  unions  of  disjoint  aggregate  states  has  the  same  formulation  in  terms  of  the 
aggregate  states  as  (6)  has  in  terms  of  individual  states. 


3.4.4.  Scale  Invariant  Relationships 

These  scale  invariant  relationships  become  further  apparent  using  (5)  and  (6) 
to  determine  the  rate  change  of  7rn(f)  with  respect  to  /;, 


dTTA{t) 

dt 


f)  771 

Ul  l—l 
m  o 

1=1  Ul 


E 


1=1 


t2 


E 


;=i 


'K'il  (t)  fit 
t 2 


(/)(*)) 


E 


1=1 


'ni,(t)(f)(ty 

t2 


(S) 

(9) 

(10) 
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where  (2)  is  substituted  into  (8)  for  each  ii  to  yield  (9).  Noting  the  definitions 
in  (5)  and  (6)  and  substituting  these  expressions  into  (10)  yields 

dTTA(t)  =  lTA{t)fA{t)  _  TTA  (t)  </)  (t) 

8t  t 2  t 2 

=  ^  IfM  -  </>(*)]  ■  (11) 

The  similarities  in  (2)  and  (11)  suggest  a  scale  invariant  structure  due  to  the 
parallel  relationships  between  individual  states  and  aggregate  states  in  terms 
of  their  stationary  probabilities,  derivatives  with  respect  to  temperature,  and 
objective  function  values.  The  following  lemma  and  its  corollary  expands 
on  these  relationships  in  a  property  referred  to  as  objective  function  comple¬ 
mentarity.  This  lemma  establishes  a  general  relationship  between  aggregate 
states,  their  complement  aggregate  states,  and  their  objective  function  values. 
The  following  definitions  are  needed: 

Definition.  Aggregate  states  A  and  D  \  A  are  said  to  be  complementary 
states.  If  A  C  B  C  then  sets  A  and  B  \  A  arc  said  to  be  complementary 
relative  to  set  B  or  simply  are  relative  complements  with  respect  to  B. 

LEMMA  1.  (Objective  Function  Complementarity)  Given  any  non-empty 
aggregate  states  A  and  f l\  A,  for  all  t  >  0, 

/.4(f)  -  (/)(f)  =  /n(f)  -  fn(t)  =  Trn\A{t)  [ fA{t )  -  /o\n(f)]  (12) 

Proof.  From  the  definitions  of  f  \  (t)  and  (f)(t), 

f A  (f )  (/)  (f )  =  fA{t)  ~  (  n{t)fi 

V  '  iefl\A 

,  m  _  7rj4(f)  Y.ieA'1Tiit)fi  _  nQ\A(t)  Y.ie<A\AKif)fi 

'4  7rn(f)  ttq\,4  (f ) 

=  /n(i)  -  nA{t)fA{t)  -  TTn\A{t)fn\A(t) 

=  [1  -  t rA{t)\  fA(t)  -  na\A(t)fn\A{t) 

Since  A  and  D  \  A  are  complement  sets,  then  1  —  nA  (t)  =  (/)  and  the 

result  follows.  ■ 

Observe  in  the  lemma  that  the  aggregate  state  A  is  obviously  contained  in 
fh  This  provides  a  clue  for  generalizing  the  property  of  Objective  Function 
Complementarity  by  considering  aggregate  states  that  have  some  comple¬ 
mentary  relationship  with  respect  to  some  subset  of  Q.  This  generalization  is 
stated  in  the  following  corollary. 
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Corollary  1  to  Lemma  1:  Let  sets  A  and  B  \  A  be  non-empty  relative 
complements  where  A  C  B  C  Q.  Then  for  all  t  >  0 

fA{t)  ~  fB{t)  =  [fA(t)  ~  fB\A{t)\  ■ 


Proof.  Applying  the  lemma  and  noting  that 

fn(t)  =  7T  B(t)fB{t)  +  tt  n\B(t)fn\B(t) 

and  substituting  this  into  the  left-hand  side  of  (12)  and  expanding  the  right- 
hand  side  yields 

fA{t)  -  (t TB(t)fB(t)  +  7T n\B(t)fn\B(tjj  = 

nn\A(t)fA(t)  ~  vr n\A(t)fn\A(t)  (13) 

Adding  tt n\B(t)fa\B(t)  and  subtracting  tt n\B(t)fB(t)  to  both  sides  of  (13) 
yields 


fA(t)-fB(t)  =  TtQ\A{t)fA(t)  ~  Tfn\A{t)fn\A(t) 

+  nn\B(t)fn\B(t)  ~  nn\B(t)fB(t)  (14) 

Now,  since  A  C  B,  then  =  (B\A)U(Ll\B),  a  union  of  two  disjoint  sets. 
Thus,  from  the  consistency  property  described  earlier  (see  Section  3.4.3), 


fn\A(t) 


7rB\n(^)/s\n(^)  +  nn\B{t)fn\B{t) 
nB\A(t)  +  ^(bW 


(15) 


Note  that  nH\  A  (t)  +  itQ\B(t)  =  tt n\A{t),  hence  (15)  becomes 

^n\A{t)fn\A{t)  =  Kt!  \(I)J'h\a(I)  +  ^ n\B{t)fn\B{t ) 

and  substituting  this  into  (14)  along  with  the  expansion  of  t%i\A(t)  above  and 
simplifying  yields 


fA(t)  -  fB{t)  = 

nB\A(t)  [/a {t)  ~  f b\a Wj  +  nQ\B{t)  [fA(t)  -  fB(t )] .  (16) 

Since  ttq,  B(t)  =  1  —  nB(t)  then  upon  further  re-arranging  and  simplifying 
of  (16)  the  result  follows.  ■ 


Note  that  when  B  =  Q  this  corollary  reduces  to  the  statement  in  Lemma  1 
(Although  this  indicates  that  the  corollary  is  a  more  general  statement,  it 
better  demonstrates  consistency  when  stated  as  a  corollary). 
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Corollary  2  to  Lemma  1:  Given  any  non-empty  complementary  aggregate 
states  A  and  12  \  A,  fA{t)  <  fn\A(t)  if  and  only  if 

f a (t)  <  <  fn\A(t). 


Proof  The  implication  leading  to  the  statement  f,\ (t)  <  /<;  ,\(t)  is  obvi¬ 
ous.  Proving  the  other  direction,  if  /4(f)  <  fn\A  (t)  then  from  (12)  it  follows 
that  fA(t )  <  since  both  sides  of  (12)  must  be  negative.  Switching  the 

sets  A  and  il  \  /I  and  applying  the  Lemma  again,  (12)  becomes 

fn\A(t)  ~  ( f)(t )  =  fn\A(t)  ~  h{t)  =  7 rA(t)  [/n\A(i)  -  /n(i)] 
with  both  sides  positive  and  the  result  follows.  ■ 

Corollary  3  to  Lemma  1:  For  any  non-empty  aggregate  states  A  and  12  \  A, 

dTTAjt)  _  -dTTn\A{t) 
dt  ~  dt 

for  all  t  >  0. 

Proof.  The  result  follows  from  the  simple  application  of  (11)  and  Lemma  1.  ■ 

This  lemma  and  its  corollaries  show  that  by  virtue  of  scale  invariance,  a 
richer  and  more  general  set  of  relationships  among  objective  function  val¬ 
ues  and  stationary  probabilities  can  be  illuminated.  Indeed,  the  significance 
of  these  relationships  is  amplified  by  how  certain  aggregate  states  mirror 
the  globally  optimal  state.  The  following  section  establishes  an  important 
relationship  between  optimal  aggregate  states  and  other  states. 

3.5.  Optimal  Aggregate  States 

Define  S'q  C  0  to  be  the  set  of  optimal  states.  Therefore,  any  f  E  So  has  a 
special  characteristic,  namely,  its  objective  function  value  is  strictly  less  than 
all  other  objective  function  values  for  states  not  in  S3, 

/,:*  <  fi  for  all  i  (jL  S0  (17) 

Note  that  in  SA  this  property  of  the  globally  optima  is  supplemented  by  the 
fact  that  fi *  <  fn(t)  at  any  temperature  t  >  0  (see  the  text  associated 
with  (2)).  Scale  invariant  relationships  should  therefore  also  be  exhibited  with 
respect  to  a  globally  optimal  aggregate  state  or  supernode. 

Defining  a  supernode  is  complicated  by  the  fact  that  the  objective  func¬ 
tion  value  associated  with  an  aggregate  state  is  a  function  of  temperature 
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t.  Nonetheless,  it  is  possible  to  draw  analogies  from  the  basic  attributes  of 
an  optimal  state  in  order  to  identify  reasonable  requirements  for  defining  a 
supernode: 

1.  A  supernode  should  have  the  same  properties  as  (17),  i.e.,  an  objective 
function  value  that  is  less  than  that  of  any  other  state  or  sets  of  states  not 
in  the  supernode; 

2.  an  objective  function  value  always  less  than  or  equal  to  the  expected 
objective  function  value. 

One  approach  for  defining  a  supernode  satisfying  these  requirements  involves 
ordering  and  indexing  states  according  to  their  objective  function  values.  This 
requires  that  the  states  be  aggregated  based  on  their  objective  function  values 
(states  with  the  same  objective  function  value  arc  thus  aggregated  together). 
Define  sets  ,S'o  through  Sp- 1  as  follows: 

for  all  i,  j  E  Sk,fi  =  fj  and 

fs0  <  f.Si  <■■■  <  f S{  <  fsi+ 1  <  •  •  •  <  fsp- 1  (18) 

for  p  distinct  objective  function  values,  where  fgk  =  f,  for  all  i  E  S&. 

Therefore,  S,  is  the  set  of  states  with  the  ilh  best  (after  the  optimal)  objec¬ 
tive  function  value.  A  supernode,  Sk,  is  therefore  defined  as  the  aggregation 
of  all  sets  with  the  k  lowest  objective  function  values, 

k 

sk  =  U  Si.  (19) 

i= 0 

The  stationary  probability  of  Sk  can  then  be  defined  (using  (5))  as  the  sum  of 
the  stationary  probabilities  of  the  states  within  the  supernode 

k 

nsk^  =  !>■*(*)• 

i= 0 

The  objective  function  value  associated  with  supernode  Sk  can  be  defined 
(using  (6))  as 

f  m  ZLo^SiWfSiit) 

JSk[  1  ~  k  „ 

Li=0  KSi(t) 

Using  these  definitions,  the  supernode  Sk  has  all  the  attributes  and  proper¬ 
ties  of  the  globally  optimal  state:  an  index  k,  a  stationary  probability  7g  (/:), 
and  an  objective  function  value  f§  (t).  Moreover,  this  supernode  has  analo¬ 
gous  relationships  to  other  states  in  terms  of  these  attributes  and  properties. 
Note  that  from  the  ordering  in  (18),  fg  (t)  has  a  lower  value  than  the  objective 
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function  value  of  any  S,  (f  S).  Lemma  2  further  shows  that  like  the  optimal 
states,  the  Sk(t)  has  an  objective  function  value  that  is  also  less  than  the 
expected  objective  function  value  for  all  t  >  0. 

LEMMA  2.  The  objective  function  value  of  a  supernode  as  defined  in  (19) 
is  less  than  the  expected  objective  function  value  for  all  t  >  0.  Thus,  given 
a  state  space  with  p  objective  function  values,  for  all  k  <  p  —  1  and  all 
temperatures  t  >  0,  f§  (t)  <  { f){t ). 

Proof  Let  S \  be  a  supernode  and  £l\Sf.  be  the  corresponding  complement 
aggregate  state.  From  the  ordering  of  sets  S,  and  the  definition  of  Sk,  every 
i  6  Sk  is  such  that  f,  <  fj  for  every  j  e  O  \  Sk  ■  Consequently,  any  convex 
combination  of  objective  function  values  of  states  in  A  is  less  than  any  con¬ 
vex  combination  of  objective  function  values  of  states  in  \ Sk-  Therefore, 
for  all  t  >  0,  fsk(t)  <  fn\sk(t)'  From  the  property  of  objective  function 
complementarity  in  Lemma  1,  for  all  t  >  0,  fsk{t)  <  </)(*)•  ■ 

This  lemma  is  used  to  prove  the  following  theorem. 

THEOREM  1 .  The  stationary  probability  of  a  supemode  S k  monotonically 
increases  with  decreases  in  the  temperature  parameter  t  (i.e.,for  all  t,  At  > 
0,  with  At  <  t,  7r^(f  —  At)  >  7 Tg  (t)). 

Proof  From  Lemma  2,  fg  (t)  <  { f){t )■  Applying  (11)  where  Sk  is  the 
aggregate  node,  d-irg  ( t)/dt  <  0  for  all  t,  which  establishes  the  result.  ■ 

The  scale  invariance  exhibited  by  Theorem  1  indicates  an  interesting  rela¬ 
tionship  among  the  states.  Recall  from  (2)  that  states  with  objective  function 
values  greater  than  {/)  (t)  have  stationary  probabilities  that  monotonically  de¬ 
crease  as  t  decreases.  (Mitra  et  ah,  1986,  p.755-6)  observed  that  non-optimal 
states  i  with  /,  <  { f)(t )  have  stationary  probabilities  that  increase  as  the 
temperature  t  is  decreased  down  to  some  critical  value  where  f  =  ( f)(t ). 
As  (f)(1)  continues  to  decrease  with  decreasing  temperature  t,  f  >  (f)(t) 
and  the  stationary  probabilities  of  these  non-optimal  states  monotonically 
decrease. 

This  behavior  of  increasing  and  decreasing  stationary  probabilities  is  also 
exhibited  within  a  supernode  as  the  temperature  is  decreased.  Observe  that 
a  supernode  S \  contains  the  non-optimal  states  in  sets  Si ...  Sk-  This  means 
that  the  stationary  probabilities  of  these  non-optimal  aggregate  states  increases 
and  then  decreases  as  the  temperature  passes  through  some  critical  tempera¬ 
ture.  Thus,  the  states  within  the  supernode  with  objective  function  values  less 
than  fn(t)  increase  while  those  with  objective  function  values  greater  than 
fn(t)  decrease.  Yet  from  Theorem  1,  the  stationary  probability  of  a  supernode 
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Supernode  Sk 


s0  s,  .. 

■  S, 

SM  ■  ■  ■ 

Sk 

"N  f)(t\ 

Increasing  objective  function  values 


Figure  1.  Aggregate  States  Ranked  by  Objective  Function  Value 


monotonically  increases.  This  must  therefore  indicate  that  the  states  within 
the  supernode  with  increasing  stationary  probabilities  more  than  offsets  those 
states  within  the  supernode  with  decreasing  stationary  probabilities. 

To  see  this,  consider  Figure  1  where  aggregate  states  So  ...  form  su¬ 
pernode  Sk  and  where  fs{  <  ( f){t )  <  fsi+1  for  states  Si  and  Si+i  both 
contained  within  the  supernode  (thus,  *  +  1  <  k).  In  this  case,  for  states  Sj 
with  j  <  i.  ditSj  (t) / dt  <  0,  hence  these  aggregate  states  have  increasing 
stationary  probabilities  in  accordance  with  (2)  and  Theorem  1 .  However,  for 
aggregate  states  Sj  C  Sf.  with  j  >  i,  Try  ,  (t)/ dt  >  0,  hence  have  mono¬ 
tonically  decreasing  stationary  probabilities.  But  from  Theorem  1,  the  entire 
supernode  has  ( t)/dt  <  0.  From  this,  and  based  on  (5), 

d7r  5,(t}  dnS,W  .  dnSk\S,M 

dt  dt  dt 

henre 

th^jt)  -e*mm 

dt  dt 

and  the  magnitude  of  the  rate  of  increasing  probability  of  Si  is  greater  than 
the  magnitude  of  the  rate  of  decreasing  probability  of  state  Sk  \  Si.  The  rela¬ 
tionships  between  these  rates  suggests  that  any  aggregate  state  that  contains 
a  supernode  will  have  a  monotonically  increasing  stationary  probability.  This 
point  motivates  the  following  discussion  and  theorem. 

Theorem  1  is  based  on  the  objective  function  value  ordering  in  (18)  and 
the  definition  of  a  supernode  i.e.,  all  objective  function  values  of  states  con¬ 
tained  in  the  supernode  arc  strictly  less  than  objective  function  values  for  all 
states  not  in  the  supernode.  Although  this  ordering  preserves  the  properties 
of  the  optimal  states,  it  is  also  somewhat  restrictive.  The  following  theorem 
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generalizes  Theorem  1  by  only  requiring  that  the  aggregate  state  contain  all 
the  optimal  states,  i.e.,  also  allows  it  to  contain  other  states. 

Define  a  partial  supernode  A  =  Sk  IJ  /I  where  /I  is  a  set  of  non-optimal 
states  and  Sk  and  A  arc  separated ,  i.e.,  there  exists  some  intervening  states 
with  objective  function  values  greater  than  the  maximum  objective  function 
value  in  Sk  and  less  than  the  minimum  objective  function  value  in  A  (see 
Figure  2  for  an  illustration). 

THEOREM  2.  Given  a  supernode  Sm  C  f 1  and  a  partial  supernode  A  C 
Sm,  where  the  objective  function  values  over  the  entire  state  space  are  non¬ 
negative  (fi  >  0  for  all  i  E  A),  then  there  exists  a  temperature  I1  where 
fn(i')  <  min{/,  :  i  E  Sm  \  A}  such  that  the  stationary  probability  of  the 
partial  supernode  monotonically  increases,  i.e.,  for  decreasing  0  <  t  <  t, 
i.e.,  dt  <  0. 

Proof.  To  clarify  the  proof,  let  B  =  Srn  \  A,  the  relative  complement  of 
A  with  respect  to  supernode  Sm.  Since  temperature  t'  is  such  that  fn{t')  < 
min {/,;  :  i  E  B }  then  it  follows  that  for  all  0  <  t  <  £ ,  fn{t)  <  /s(f).  But 

fs{t)  -  fn(t)  =  fB(t )  -  fSm  (t)  +  fSm  (i)  -  fn{t)  >  0  (20) 

Re-writing  (20)  using  Lemma  1 

7r.  (j)  [fstf)  ~  fA(t)]  +nn\sm  (*)  [/,,,(/■)  _/o\sw^)]  >  0 
where  A  =  Sm\B. 

Now  note  that  the  second  term  in  (21)  is  always  negative  (from  Lemmas  1 
and  2),  hence  the  first  term  must  be  positive.  Consequently,  in  reversing  fn  (/:) 
and  fy[{t)  in  (21)  and  adding  the  second  term  yields 

7r.  u\  lfA(t)  ~  fs{t )]  +  na\sm  [fsJt)  -  <  0  (22) 

Sm  ^  ‘ 

Applying  Lemma  1  and  this  time  noting  that  B  =  Sm  \  A,  the  terms  in  (22) 
become 

[fA(t)  -  fsJt )]  +  [f§Jt)  -  fn(t)]  <  0 

hence,  for  all  0  <  t  <  t',  J'a(I)  ~  fn{t)  <  0  and  therefore  from  (11), 
dTTj\(t)/dt  <  0  and  the  stationary  probability  of  the  partial  supernode  mono¬ 
tonically  increases.  ■ 

The  monotonic  behavior  of  the  objective  function  values  for  the  supernode 
and  the  partial  supernode,  as  well  as  the  form  of  the  rate  of  change  of  the 
stationary  probability  demonstrate  scale  invariance  in  the  SA  algorithm. 
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Figure  2.  Partial  Supernode  with  arbitrary  and  separated  aggregate  state  A 

3.6.  Scale  Invariance  in  Second  Moments 

The  scale  invariance  described  so  far  involved  only  the  first  moments  of  ob¬ 
jective  function  values — expressions  and  formulations  with  J.  This  section 
presents  results  on  scale  invariance  involving  second  moments  with  terms 
containing  ff.  Once  again,  a  basis  for  comparison  is  needed.  One  useful 
relationship  is  the  derivative  of  the  expected  objective  function  value  with 
respect  to  temperature: 

d(f)(t)  =  dhW  (/2)(f)-[(/)(f)]2 

dt  dt  t2  t2  K  ’ 

(Aarts  and  Korst,  1989,  p.20)  where  the  second  moment  of  /  is  defined  by 

=  ^7ri(t)/i2. 
ien 

and  4W  's  the  variance  of  objective  function  values  over  the  entire  state 
space  at  temperature  f. 

As  noted  earlier,  scale  invariance  requires  showing  that  reasonable  defi¬ 
nitions  of  certain  quantities  for  aggregated  states  have  similar  relationships 
to  other  quantities  as  do  the  analogous  quantities  for  individual  states.  To 
that  end  and  using  the  same  approach  and  justifications  as  in  (6),  define  the 
variance  of  the  objective  function  of  a  lumped  node  A  by  the  conditional 
variance 

o*A(t)  =  E{(F-fA(t))2\lA(t)  =  l} 
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=  £[/<  -  fA(t)]2  Pr{-F  =  fi  |  1  A(t)  =  1} 
ieci 


/,4(t)]2Pr{F  =  /?Al,4(t)  =  l} 
Pr{1.4  (t)  =  1} 


ieA 


VifA{t)  +  [fA{t)]2]  7 Tj(t) 
KA(t) 


TneA^i(t)fi  T,ieA^i(t)[fA{t)} 


KA{t) 

=  (f2)A(t)~[fA(t)f 


71/4  (*) 


(24) 


where  the  second  moment  of  the  objective  function  of  lumped  node  A  at 
temperature  t  is  indicated  by  {f2)A{t). 

Now  that  a  suitable  definition  for  the  variance  of  a  lumped  node  has  been 
defined,  scale  invariance  can  be  seen  in  the  following  similar  relationship  as 
in  (23): 


dfA(t) 

dt 


d_ 

dt 


ZieA^m) 


=  E 


7A4  (*) 

9  (  7 Tj(t)fj\ 


) 


ieA  dt  V  tt A(t)  J 

Taking  derivatives  in  (25)  leads  to 

-  TT-/  {t)  fi  jjjTTA  {t) ' 


E 

ieA 


n\(t) 


Recall  that  _  /„(*)]  and  =  ^[fA(t) 

Substituting  these  expressions  into  (26)  and  simplifying  leads  to 

dfA{t)  _  1^7 Tj{t)fjrj.  ;  . 

+2  ~  (4-\  J A\t) \ 


dt 


t2  f^A  *A(t) 

ST-'  _ (_  \  ^  7T?;  [t)  fjf  A {t) 

"  t2ki  nAt)  t 2,^  7r.4(t) 


4  [</TtW -/!(*) 


c 

E 

t2 


(25) 


(26) 

/«(*)]• 


(27) 


with  scale  invariance  shown  in  the  correspondence  between  equations  (23) 
and  (27) — scale  invariance  thus  holds  for  second  moments.  The  next  section 
describes  how  scale  invariance  in  SA  can  be  used  in  a  variety  of  ways  that,  in 
some  instances,  arc  unavailable  in  other  optimization  schemes. 
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4.  Applications 

One  of  the  hallmarks  of  SA  that  makes  it  especially  attractive  is  its  generality. 
Indeed,  it  is  SA’s  very  foundation  in  thermodynamics  that  permits  it  to  be  used 
as  a  metaheuristic — an  optimization  scheme  that  can  be  applied  to  numerous 
optimization  problems.  It  is  also  this  foundation  that  gives  rise  to  its  scale 
invariance  by  virtue  of  the  exponential  form  of  the  Boltzmann  Distribution. 
It  is  this  universality  that  provides  clues  as  to  the  benefits  of  SA’s  scale 
invariance — benefits  that  provide  numerous  avenues  for  the  development  of 
new  methodologies  that  take  advantage  of  SA’s  scale  invariance. 

The  fundamental  utility  of  scale  invariance  as  a  property  of  some  under¬ 
lying  phenomenon  is  that  it  permits  inferences  to  easily  be  made  on  different 
scales  based  on  some  observed  phenomenon  or  mathematical  characteristics 
associated  with  some  given  scale.  It  can  therefore  enable  or  facilitate  the 
analysis  of  a  phenomenon  at  different  scales  or  when  information  is  available 
only  for  contrived  situations.  In  addition,  if  statistically  based  inferences  arc 
possible  at  one  scale,  which  is  certainly  the  case  with  SA,  scale  invariance 
can  enable  or  facilitate  statistical  inference  on  other  scales.  Since  SA  is  used 
as  an  optimization  scheme  for  many  different  types  of  problems,  it  is  not 
surprising  that  its  scale  invariance  properties  offer  some  advantages  over 
other  optimization  schemes.  Scale  invariance  properties  in  SA  can  therefore 
provide  tools  that  facilitate  the  solution  of  both  theoretical  and  practical  prob¬ 
lems  in  analysis,  statistical  inference  and  optimization.  This  section  provides 
examples  that  explore  and  highlight  these  three  potential  application  areas  of 
SA’s  scale  invariance  properties  and  offers  directions  for  future  research  in 
the  development  of  new  experimental  and  computational  methodologies. 

Section  4.1  describes  an  example  where  SAs  scale  invariance  was  used 
in  analysis  to  extend  results  of  a  contrived  situation  to  a  more  general  situ¬ 
ation.  Section  4.2  describes  how  scale  invariance  in  SA  offers  new  avenues 
for  performing  statistical  inference  with  SA  by  showing  how  it  is  possible 
to  define  confidence  intervals  for  the  value  of  specified  decision  variables  in 
the  optimal  solution  without  necessarily  converging  to  the  optimum.  Finally, 
Section  4.3  extends  the  ideas  in  Section  4.2  and  describes  a  type  of  branch 
and  probability  bound  algorithm  based  on  scale  invariance  that  may  improve 
the  finite-time  efficiency  of  SA. 

4.1.  Examples  of  SAs  Scale  Invariance  in  Analysis 

Scale  invariance  often  provides  an  attractive  angle  of  attack  in  the  analysis 
of  problems.  Indeed,  this  was  the  motivating  factor  in  creating  and  exploring 
SAs  scale  invariance  property  in  (Fleischer,  1993).  (Fleischer,  1993)  obtained 
an  analytical  result  for  the  contrived  situation  where  the  expected  objective 
function  at  temperature  t,  /<>(/:),  was  such  that  fs0  <  /<>(/  )  <  fs, —  between 
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the  least  cost  and  next-to-least  cost  objective  function  values.  Such  a  situation 
arises  in  practical  circumstances  only  after  SA  has  almost  converged. 

The  analysis  to  which  scale  invariance  was  applied  involved  the  transitions 
between  the  two  sets  of  states,  fs0  and  and  the  distribution  of  typical  (this 
term  is  a  reference  to  the  Asymptotic  Equipartition  Property  from  information 
theory)  annealing  sequences  (Goldie  and  Pinch,  1991).  This  contrived  situa¬ 
tion  made  the  analysis  more  tractable  since  it  was  easier  to  define  “typicality” 
(see  (Fleischer  and  Jacobson,  1999;  Fleischer,  1993)). 

Extending  this  result  to  the  more  general  circumstance  where  for  some 
k.  fgk  <  <  fs,  required  a  new  definition  of  an  optimal  solution  so 

that  the  analysis  of  this  more  general  situation  could  proceed  in  an  analogous 
manner  as  in  the  contrived  situation.  By  lumping  the  states  with  the  lowest 
k  objective  function  values  together  thereby  creating  a  supernode ,3)  with  all 
of  the  same  properties  as  the  optimal  solution,  the  subsequent  analysis  was 
possible  and  made  much  easier  due  to  SA’s  scale  invariance  property.  Other, 
as  yet  unknown,  problems  in  analysis  may  very  well  be  facilitated  by  using 
the  scale  invariance  property  of  SA. 

4.2.  Statistical  Inference  by  Simulated  Annealing 

The  scale  invariance  of  second  moments  and  variances  described  above  sug¬ 
gest  applications  in  the  realm  of  statistical  inference.  This  application,  re¬ 
ferred  to  as  statistical  inference  by  simulated  annealing  (SISA),  constitutes  a 
new  way  to  perform  statistical  inference  using  SA.  Since  lumped  states  can 
be  defined  by  appropriate  constraints  on  decision  variables,  new  methodolo¬ 
gies  and  new  analytical  and  experimental  tools  become  available  to  assess 
quantities  associated  with  the  lumped  states.  One  potential  value  of  this  is 
evident: 

If  the  partitioning  of  the  solution  space  is  effected  by  putting  constraints 
on  a  specified  variable,  the  value  of  that  variable  in  the  optimal  solution 
can  be  determined  with  a  specified  level  of  confidence. 

This  permits  values  of  specified  decision  variables  to  be  statistically  ascer¬ 
tained  without  necessarily  obtaining  a  good  estimate  of  the  entire  ensemble 
of  decision  variables.  This  possibility  constitutes  a  feature  of  SA  that  appears 
to  be  unique  among  metaheuristics. 

This  section  briefly  describes  the  potential  avenues  for  statistical  infer¬ 
ence  based  on  SA’s  scale  invariance  using  basic  ideas  and  concepts.  The  goal 
here  is  to  introduce  some  of  the  basic  aspects  of  SISA  and  how  various  test 
statistics  can  be  engineered  to  take  advantage  of  SAs  scale  invariance.  Issues 
regarding  sampling,  the  use  of  ratio  estimates,  the  convergence  of  these  es¬ 
timates,  and  the  exact  distribution  of  the  relevant  random  variables  and  test 
statistics  may  therefore  become  active  areas  of  future  research. 
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It  is  worth  noting  that  SISA  is  intimately  connected  to  the  mathematics 
of  Markov  chains.  Recall  that  SA  itself  is  embodied  by  a  Markov  chain. 
As  such,  the  statistical  methodologies  associated  with  Markov  chain  Monte 
Carlo  (MCMC)  techniques  arc  applicable  (see  (Norris,  1997)).  These  tech¬ 
niques  require  that  the  SA  algorithm  is  executed  at  a  fixed  temperature  and  the 
various  objective  function  values  produced  at  each  iteration  recorded.  These 
objective  function  values  provide  the  raw  data  needed  to  calculate  various  test 
statistics. 

The  basic  idea  behind  SISA  is  to  partition  the  domain  space  of  a  problem 
into  mutually  exclusive  subsets,  say  A  and  I)  \  A,  and  to  make  inferences 
as  to  which  subset  contains  the  global  optimum.  This  is  done  by  comput¬ 
ing  estimates  of  fn(t).  fA(t),  fn\A(t),  ^(t),  and  a2A{t),  and  testing  whether 
fA{t)  —  fn\A{t )  =  0.  From  the  property  of  Objective  Function  Complemen¬ 
tarity  (see  Lemma  1  and  its  corollaries),  if  fA{t)  —  fn\A{t)  <  0(>  0),  then 
it  is  possible  to  infer  that  A  (fl  \  A)  contains  the  global  optimum. 

Consider  the  following  illustrative  example:  Let  P  be  a  decision  problem 
with  an  n  vector  of  decision  variables  x  =  {x\,X2, ...  ,Xi, ...  ,xn},  where 
Xk  G  {0, 1}  for  all  k.  Partition  the  domain  space  into  two  mutually  exclusive 
and  exhaustive  sets  A  =  {x  G  ft  :  =  0}  and  \  A  =  {x  G  O  : 

Xk  =  1}  for  some  specified  k.  Therefore,  sets  A  and  O  \  ,4  constitute  two, 
complementary  lumped  states. 

SISA  is  performed  to  ascertain  the  value  of  the  specified  decision  variable 
Xk-  Thus,  SA  (MCMC)  experiments  arc  executed  at  some  fixed  tempera¬ 
ture  t  for  m  iterations  and  output  analysis  is  done  in  the  standard  way  (see 
e.g.,  (Hobert,  2001)  for  recent  work  on  MCMC).  Each  such  experiment  pro¬ 
duces  two  stochastically  generated  sequences  of  objective  function  values 
and  solution  states  of  length  m.  For  the  ih  experimental  replication  of  SA 
at  temperature  t 


{  fi.ti  ■  j  }  /-_  |  —  { fi,a+li  fi,a+2i  •  •  •  j  /i,a+m  } 

constitutes  the  sequence  of  objective  function  values  and 

{xj,a+j}j=l  =  {xi,a+lj  xi,a+2j  •  •  •  j  xi,a+m}  (28) 

constitutes  the  sequence  of  corresponding  solution  states,  where  a  is  some 
index  count  sufficiently  high  so  as  to  ensure  that  the  simulation  achieves 
steady-state. 

A  somewhat  less  efficient  though  simpler  experimental  design  to  analyze 
is  to  run  i  =  1, ...  ,r  independent  replications  of  SA  with  the  same  initial 
conditions  and  different  sequences  of  pseudo-random  variables.  This  pro¬ 
duces  the  following  realizations  of  objective  function  values  (the  subscript 
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a  is  ignored  to  simplify  the  expressions): 

/ 11)  ■  ■  ■  j  .1  \j-  ■  •  •  j  flm 

f 2 1  •  •  ■  ■  :  f‘l  j  •  ■  ■  flm 

frli  ■  ■  ■  i  frji  ■  ■  ■  i  frm 


(29) 


Each  column  of  (29)  represents  a  set  of  i.i.d.  realizations  of  objective  function 
values  (see  (Law  and  Kelton,  1991,  Ch.  9))  approximately  distributed  accord¬ 
ing  to  the  Boltzmann  Distribution  at  temperature  t  (to  avoid  cumbersome 
notation,  the  temperature  parameter  and  the  reference  to  the  f  column  are 
dropped,  where  it  is  understood  that  the  realizations  of  random  variables  are 
i.i.d.  with  the  simulation  executed  at  temperature  t).  Define  the  following 
random  variables  based  on  the  random  objective  function  value  Fj  generated 
in  the  jth  iteration  of  experiment  i  as 


ZjL,,*Sjl«€X)  p  £i=i  ft ji(«  e  »  \  -4) 
E[=i  i(*  e  2)  ’  n'-4  E[=i  i(i  e  Si  \  2) 


and 

D  =  FA  —  Fn\A. 

This  naturally  leads  to  the  following  definitions,  for  a  given  column  j,  for  the 
test  statistics: 


Ta 


ELt  1a 


fn\A 


ELt  fijin\A  T  _  ELt  fa 

Er  -i  5  J 

i=i  xn\A  r 


(30) 


as  estimators  of  fAl  f<A  and  fo ,  respectively.  Note  that  in  the  denomina¬ 
tors  in  (30)  and  for  any  column  j ,  5Z?:=i  1  a  =  r  —  Ef=i  lo  i-  Using  the 
estimates  in  (30),  define 

D  =  Ja  —  fn\A 

as  an  estimate  of  d  =  f  \  —  f<A  /\  with  three  degrees  of  freedom  (it  can  be 
shown  via  the  Central  Limit  Theorem  that  D-^'N((fA  —  fn\A),  o2D)-  Fur¬ 
ther  analysis  will  shed  light  on  the  exact  distribution  function  for  finite-time 
executions  of  SA). 

Using  these  estimates,  various  forms  of  statistical  inference  arc  possible. 
One  approach  is  to  test  the  following  null  hypothesis  against  the  alternative 
hypothesis: 

H q:  d  =  0 

H,\ :  (I  <  0  or  d  >  0 

Should  the  test  statistics  result  in  the  decision  to  reject  Hq  then,  depending 
on  whether  I)  <  0  or  D  >  0,  one  could  infer  that  d  <  0  or  d  >  0  and  from 
the  corollary  to  Lemma  1,  that  f  \  <  {>).[' o_,  \-  One  can  then  conclude  that 
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A  or  il  \  /I  contains  the  global  optimum  and  hence  that  x,  =  0  (1)  in  the 
optimal  solution.  This  also  suggests  that  computing  a  confidence  interval  on 
the  value  of  d  can  also  be  useful  in  deciding  whether  x,  =  0  or  I  provided 
it  does  not  contain  0.  The  foregoing  also  suggests  how  a  confidence  measure 
can  be  assigned  to  the  entire  ensemble  of  decision  variables. 

One  of  the  implications  of  SISA  is  that  confidence  intervals  for  each  one 
of  the  n  decision  variables  can  be  obtained  by  using  the  data  in  (29)  and 
appropriate  definitions  of  lumped  states.  This  provides  an  approach  to  opti¬ 
mization  where  one  can  assign  a  confidence  level  to  each  decision  variable 
in  a  putative  solution — something  not  readily  available  in  other  optimization 
schemes.  Furthermore,  it  may  be  possible  to  improve  on  this  approach  and 
the  efficiency  of  running  SA  by  modifying  the  search  algorithm  to  search 
only  the  partition  deemed  to  have  a  high  probability  (high  confidence  level) 
of  containing  the  optimal  solution.  This  idea  is  the  subject  of  the  next  section. 

4.3.  Partitioning  Algorithms 

A  practical  application  of  the  scale  invariance  in  SA  is  in  the  design  of  a 
partition,  or  branch  and  probability  bound  algorithm.  See  (Shi  andOlafsson, 
2000)  for  a  description  of  a  type  of  partition  algorithm  based  on  so-called 
nested  partitions.  Such  a  partition  algorithm  would  use  SISA  on  each  par¬ 
tition  to  identify  those  subsets  containing  states  with  certain  characteristics 
(such  as  the  optimal  objective  function  value;  see  e.g.,  (Pinter,  1996)).  For 
these  types  of  algorithm,  the  search  is  continued  using  the  remaining  subsets 
as  a  new  domain  space.  This  process  continues  and  sequentially  shrinks  the 
search  space  in  the  hopes  that  it  provides  a  more  efficient  search. 

Because  the  decision  rule  for  excluding  a  subset  is  probabilistic,  such 
a  partition  algorithm  is  also  a  type  of  branch  and  probability  bound  algo¬ 
rithm  in  which  a  subset  containing  some  desirable  feature  is  identified  with  a 
high  probability  based  on  some  prospectiveness  criterion  (Zhigljavsky,  1991, 
p.  147).  The  scale  invariance  in  SA  readily  lends  itself  to  both  partition  and 
branch  and  probability  bound  type  algorithms  and  the  development  of  novel 
prospectiveness  criteria. 

Whereas  the  example  of  SISA  above  employed  partitioning  the  state  space 
into  two  mutually  exclusive  subsets  to  determine  the  value  of  a  single,  spec¬ 
ified  decision  variable,  it  is  also  possible  and,  perhaps  desirable,  to  partition 
the  domain  space  into  a  larger  number  of  mutually  exclusive  subsets.  Once 
this  is  done,  the  subset  deemed  least  likely  to  contain  the  optimum  is  then 
excluded  from  further  search.  This  provides  a  more  conservative  decision 
rule  for  shrinking  the  domain  space  and  hence  lowers  the  probability  of  Type 
I  errors — i.e.,  the  probability  of  excluding  portions  of  the  domain  space  con¬ 
taining  the  global  optimum.  A  natural  question  to  then  ask  is  how  to  rank  the 
various  partitions. 
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One  way  of  ranking  the  partitions  is  based  on  the  scale  invariance  ex¬ 
hibited  by  (11)  which  is  a  measure  of  the  rate  of  change  of  the  stationary 
probability  with  respect  to  temperature.  An  estimate  of  (1 1)  for  each  partition 
provides  more  information  than  the  value  of  D  =  f ,\  —  /<;  A  since  (11) 
weights  the  quantity  D  by  the  quantity  7r  \ /t2. 

Using  these  ideas,  a  recursive  partitioning  algorithm  can  be  designed  in 
which  partitioning  is  repeated  at  successive  iterations  to  shrink  the  state  space. 
Consider  a  partition  of  a  discrete  optimization  problem  defined  by  the  sets 
.4  [ .  /!  2  •  •  ■  ■  ■  Am  based  on  some  scheme  that  may  take  advantage  of  some 
underlying  structure  (although  the  partitioning  can  in  fact  be  completely  ar¬ 
bitrary).  Recall  from  Theorem  2  that  an  aggregate  state  containing  an  optimal 
state  has  a  stationary  probability  that  monotonically  increases  for  sufficiently 
low  temperatures  (how  low  this  temperature  must  be  for  monotonicity  is  in¬ 
dicated  in  the  theorem  statement).  Using  a  fixed  temperature  t,  an  estimate  of 
the  derivative  of  the  stationary  probability  of  each  partition  can  be  obtained 
from  estimates  of  each  component  of  (11).  Thus,  given  an  SA  experiment 
producing  a  sequence  of  states  (see  (28)),  define  the  estimator 


d^A, 


nA,. 


f  Ak  ffl 


dt  t 2 

where  the  f  \  .  and  /o  arc  defined  as  in  (30)  and 

Em  -| 

;=1  LAk 


71 'Ak  = 


—  ^3 


m 


(31) 


(32) 


is  an  estimator  for  ttAk  (i)  •  The  estimate  in  (31)  is  used  as  the  prospective¬ 
ness  criterion  in  the  branch  and  probability  bound  nomenclature  (Zhigljavsky, 
1991). 

Each  partition  Af~  is  thus  assigned  a  value  given  by  (31)  which  is  used  in 
a  statistical  hypothesis  test  (similar-  to  the  one  described  in  Section  4.2)  that 
tests  whether  the  optimal  state  f  E  Ak .  The  partition  with  the  highest  value 
of  (31),  and  hence  the  lowest  p- value,  is  eliminated  from  the  state  space.  The 
process  is  then  repeated  on  the  remaining  set  of  states,  i.e.,  a  new  set  of  k 
partitions  are  established,  SA  is  executed  on  this  smaller  domain,  and  the 
necessary  statistics  computed.  The  procedure  is  repeated  until  termination 
occurs. 

Note  that  the  estimate  of  the  rate  of  change  can  be  obtained  without  actu¬ 
ally  changing  the  temperature.  In  effect,  this  allows  one  to  use  perturbation 
analysis  to  determine  those  partitions  in  which  the  stationary  probability  is 
either  increasing  or  decreasing,  hence  whether  it  is  likely  to  contain  an  opti¬ 
mal  state  (Fu  and  Hu,  1992).  Instead  of  running  the  algorithm  at  temperature 
t,  obtaining  statistics,  and  rerunning  the  algorithm  at  the  lower  temperature 
t  —  e,  a  single  execution  can  be  used  to  estimate  the  rate  of  change  of  the 
stationary  probability  of  each  partition. 
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This  procedure  recursively  shrinks  the  state  space.  If  the  probability  of  a 
Type  I  error  is  sufficiently  small,  this  procedure  may  provide  an  efficient  way 
of  searching  for  optimal  states.  It  is  worth  noting  that  in  some  sense,  there  is 
an  equivalence  between  annealing  by  lowering  the  temperature  and  annealing 
by  successively  reducing  the  search  space. 


5.  Future  Research  and  Conclusion 

This  article  has  explored  various  scale  invariance  properties  in  the  SA  al¬ 
gorithm.  The  type  of  scale  invariance  examined  here  was  based  on  scaling 
the  size  of  the  state  space  by  lumping  states  into  aggregate  states.  Analysis 
of  the  SA  algorithm  with  respect  to  these  aggregated  states  indicates  a  form 
of  scale  invariance  because  the  aggregated  states  exhibit  similar  behaviors 
as  individual  states.  When  these  aggregated  states  are  assigned  an  objective 
function  value  based  on  a  conditional  expectation  value,  various  relation¬ 
ships  arc  preserved  between  their  steady-state  probability  and  the  expected 
objective  function  value.  This  produces  a  number  of  relational  properties 
such  as  Objective  Function  Complementarity  (Lemma  1  and  its  corollaries) 
and  monotonicity  (Theorems  1  and  2).  Scale  invariance  properties  in  second 
moments  were  also  described  as  well  as  relationships  between  the  rate  change 
of  the  expected  objective  function  value  with  respect  to  temperature.  These 
results  collectively  suggest  that  groups  of  nodes  or  sets  of  states  can  be  treated 
or  viewed  as  single  states.  Properties  such  as  convergence  in  probability  to  a 
state  can  therefore  be  extended  to  convergence  in  probability  to  a  set  of  states. 

Scale  invariance  provides  a  new  way  of  viewing  the  SA  algorithm  and 
provides  a  solid  basis  for  new  research  into  methodologies,  applications,  and 
implementations  of  SA  that  take  advantage  of  this  property.  Potential  appli¬ 
cations  include  using  these  scale  invariance  properties  in  analysis.  Because 
scale  invariance  properties  also  relate  the  variance  of  objective  function  val¬ 
ues  to  other  quantities,  statistical  inference  is  possible.  SISA  provides  for 
the  possibility  of  making  inferences  about  the  value  of  any  specified  de¬ 
cision  variable  without  necessarily  obtaining  the  optimal  solution.  Finally, 
applications  of  scale  invariance  in  optimization  were  described. 

Applications  in  optimization  arc  based  on  the  notion  of  recursive  func¬ 
tionals  where  subsets  of  nodes  constituting  a  sub-domain  arc  partitioned  into 
mutually  exclusive  subsets.  The  subset  least  likely  to  contain  the  optimum,  as 
indicated  by  a  prospectiveness  criterion,  is  then  excluded  and  SA  re-executed 
on  the  remaining  states.  This  is  similar  to  nested  partition  algorithms  and 
branch  and  probability  bound  algorithms.  The  approach  of  scaling  the  state 
space  can  be  especially  advantageous  for  continuous  problems  where  this 
process  can  be  repeated  ad  infinitum  to  induce  SA  to  converge  to  a  small 
neighborhood  of  an  optimal  state.  SA  can  therefore  be  used  recursively  to 
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yield  a  more  efficient  use  of  computer  resources,  and  hence,  improve  the 
finite-time  performance  of  SA. 

For  scale  invariance  to  achieve  its  true  potential,  however,  these  areas 
of  application  in  analysis,  statistical  inference  and  optimization  all  require 
further  investigation.  In  SISA,  research  into  the  distribution  of  test  statistics 
for  finite-time  executions  of  SA  would  assist  in  determining  the  efficacy  of 
SISA  on  particular  problems.  Experimentation  may  also  shed  light  on  how 
best  to  implement  SISA  concepts.  Such  research  would  also  support  efforts 
to  use  scale  invariance  properties  in  optimization. 

The  potential  use  of  scale  invariance  in  analysis  hints  at  new  discover¬ 
ies  to  come.  As  is  often  the  case,  new  patterns  such  as  those  exhibited  by 
scale  invariance,  require  some  time  to  germinate  before  their  true  potential 
is  achieved.  Connections  between  SA  and  random  Markov  fields  (see  e.g., 
(Boykov  and  Zabih,  1998;  Li,  1995))  may  provide  entirely  new  ways  of 
analyzing  and  solving  complex  problems. 

The  SA  algorithm  has  been  used  to  solve  numerous  hard  discrete  opti¬ 
mization  problems.  SA  has  been  framed  as  a  “meta-heuristic”  owing  to  its 
generality.  Viewing  it  strictly  as  an  algorithm,  however,  imposes  a  limited 
perspective  on  SA  and  diminishes  its  significance.  Thus,  rather  than  view¬ 
ing  it  strictly  as  an  algorithm,  SA  should  be  used  as  a  tool  for  modelling 
the  dynamics  and  complexity  associated  with  discrete  optimization  prob¬ 
lems.  This  perspective  unleashes  the  hidden  value  of  SA:  the  analogies  it 
draws  between  discrete  optimization  problems,  information  theory,  and  ther¬ 
modynamics  (Fleischer  and  Jacobson,  1999).  The  scale  invariance  properties 
examined  in  this  paper  illustrate  only  a  few  of  the  many  potential  connec¬ 
tions  between  these  areas  of  inquiry.  Other  ways  to  take  advantage  of  the 
scale  invariance  described  here  and  further  development  of  the  connections 
to  information  theory  and  thermodynamics  arc  possible.  Our  hope  is  that  this 
paper  will  encourage  similar  discoveries  in  this  remarkable  algorithm. 
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